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Abstract 

We consider the question “How should one act when the only goal is to learn as much as possible?” Build¬ 
ing on the theoretical results of Fedorov [1972] and MacKay [1992], we apply techniques from Optimal 
Experiment Design (OED) to guide the query/action selection of a neural network learner. We demon¬ 
strate that these techniques allow the learner to minimize its generalization error by exploring its domain 
efficiently and completely. We conclude that, while not a panacea, OED-based query/action has much to 
offer, especially in domains where its high computational costs can be tolerated. 
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1 Introduction 


In many natural learning problems, the learner has the 
ability to act on its environment and gather data that 
will resolve its uncertainties. Most machine learning re¬ 
search, however, treats the learner as a passive recipient 
of data and ignores the role of this “active” component 
of learning. In this paper we employ techniques from 
the field of Optimal Experiment Design (OED) to guide 
the actions of a learner, selecting actions/queries that 
are statistically expected to minimize its uncertainty and 
error. 



Passive learning 


1.1 Active learning 

Exploiting the active component of learning typically 
leads to improved generalization, usually at the cost, of 
additional computation (see Figure 1) [Angluin, 1982; 
Cohn et. al., 1990; Hwang et. ah, 1991]. 1 There are two 
common situations where this tradeoff is desirable: In 
many situations the cost, of taking an action outweighs 
the cost, of the computation required to incorporate new 
information into the model. In these cases we wish to 
select, queries judiciously, so that we can build a. good 
model with the fewest, data. This is the case if, for exam¬ 
ple, we are drilling oil wells or taking seismic measure¬ 
ments to locate buried waste. In other situations the 
data, although cheap, must, be chosen carefully to en¬ 
sure thorough exploration. Large amounts of data, may 
be useless if they all come from an uninteresting part, 
of the domain. This is the case with learning control 
of a. robot, arm: exploring by generating random motor 
torques can not. be expected to give good coverage of the 
domain. 

As computation becomes cheaper and faster, more 
problems fall within the realm where it. is both desirable 
and practical to pursue active learning, expending more 
computation to ensure that, one’s exploration provides 
good data. The field of Optimal Experiment. Design, 
which is concerned with the statistics of gathering new 
data, provides a. principled way to guide this exploration. 
This paper builds on the theoretical results of Fedorov 
[1972] and Ma.cKa.y [1992] to empirically demonstrate 
how OED may be applied to neural network learning, 
and to determine under what, circumstances it. is an ef¬ 
fective approach. 

The remainder of this section provides a. formal prob¬ 
lem definition, followed by a. brief review of related work 
using optimal experiment, design. Section 2 differentiates 
several classes of active learning problems for which OED 
is appropriate. Section 3 describes the theory behind 
optimal experiment, design, and Section 4 demonstrates 
its application to the problems described in Section 2. 
Section 5 considers the computational costs of these ex¬ 
periments, and Section 6 concludes with a. discussion of 
the results and implications for future work. 


1 In some cases active selection of training data can 
sharply reduce worst, case computational complexity from 
NP-complet.e to polynomial time [Baum and Lang, 1991], and 
in special cases to linear time. 



Active learning 


Figure 1: An active system will typically eva.lua.t.e/t.ra.in 
on its data, iteratively, determining its next, input, based 
on the previous training examples. This iterative train¬ 
ing may be computationally expensive, especially for 
learning systems like neural networks where good incre¬ 
mental algorithms are not. available. 

1.2 Problem definition 

We consider the problem of learning an input-output, 
mapping A' —*■ Y from a. set. of m training examples 
{(•»*, 2/*)}"=i, where x { E A', y { E Y. 

We denote the parameterized learner f w (), where 
y = f w (x) is the learner’s output, given input, x and 
parameter vector w. The learner is trained by adjust¬ 
ing w t.o minimize the residual S 2 = ^7 (/«'(•£*) — 

<Ji) T { fw ( Xi) ~ Vi) on the training set. Let. w be the 
weight, vector that minimizes S 2 . Then y = ( x ) is 
the learner’s “best, guess” of the mapping A' —*■ Y: given 
x , y is an estimate of the corresponding y. 

At. each time step, the learner is allowed to select, a. 
new training input, x from a. set. of candidate inputs A'. 
The selection of x may be viewed as a. “query” (as to an 
oracle), as an “experiment.,” or simply as an “action.” 
Having selected x, the learner is given the correspond¬ 
ing y, and the resulting new example (x,y) is added to 
the training set.. The learner incorporates the new data, 
selects another new x and the process is repeated. 

The goal is to choose examples that minimize the ex¬ 
pectation of the learner’s mean squared error Emse = 
{(fw(x) - y) T (f w (x) - y)) x , where (-) x represents the 
expected value over A'. In contrast, to some other learn- 










ing paradigms [Valiant, 1984; Blumer et al., 1986], we 
will assume that the input distribution Vx is known. 2 
Below, we present several example problems: 

Example 1: mapping buried waste. Consider a 
mobile sensor array traversing a landscape to map out 
subsurface electromagnetic anomalies. Its location at 
time t serves as input x t , and the instrument reading 
at that location is output y t . At the next time step, it 
can choose its new input x from any location contiguous 
to its present position. 

Example 2: robot arm dynamics. Consider learn¬ 
ing the dynamics of a robot arm. The input is the 
state-action triplet x t = {0t,0t,r t }, where 0 t and 0 t 
are the arm’s joint angles and velocities, respectively, 
and r t is the torque applied at time t. The output 
yt = {0 t _|_i, 0t+i} is the resulting state. Note that here, 
although we may specify an arbitrary torque r t , the rest 
of the input, {0*, 0*} is determined by yt-i- 

We emphasize that while the above problem defini¬ 
tion has wide-ranging application, it is by no means all- 
encompassing. For some learning problems, we are not 
interested in the entire mapping X —>■ Y, but in finding 
the x that maximizes y. In this case, we may rely on 
the broad literature of optimization and response sur¬ 
face techniques [Box and Draper, 1969]. In other learn¬ 
ing problems there may be additional constraints that 
must be considered, such as the need to avoid “failure” 
states. If the learner is required to perform as it learns 
(e.g. in a control task), we may also need to balance ex¬ 
ploration and exploitation. Such constraints and costs 
may be incorporated into the data selection criterion as 
additional costs, but these issues are beyond the scope 
of this paper. In this paper we assume that the cost of 
the query x is independent of x, and that the sole aim 
of active learning is to minimize Emse- 

1.3 Related work with optimal experiment 
design 

The literature on optimal experiment design is immense 
and dates back at least 50 years. We will just mention 
here a few closely related theoretical results and empiri¬ 
cal studies; the interested reader should consult Atkinson 
and Donev [1992] for a survey of results and applications 
using optimal experiment design. 

A canonical description of the theory of OED is given 
in Fedorov [1972]. MacKay [1992] showed that OED 
could be incorporated into a Bayesian framework for 
neural network data selection and described several in¬ 
teresting optimization criteria. Sollich [1994] considers 
the theoretical generalization performance of linear net¬ 
works given greedy vs. globally optimal queries and vary¬ 
ing assumptions on teacher distributions. 

Empirically, optimal experiment design techniques 
have been successful when used for system identifica¬ 
tion tasks. In these cases a good parameterized model 
of the system is available, and learning involves finding 

2 Both assumptions are reasonable in different situations; 
if we are attempting to learn to control a robot arm, for 
example, it is appropriate to assume that we know over what 
range we wish to control it. 


the proper parameters. Armstrong [1989] used a form 
of OED to identify link masses and inertial moments 
of a robot arm, and found that automatically gener¬ 
ated training trajectories provided a significant improve¬ 
ment over human-designed trajectories. Subrahmonia et 
al. [1992] successfully used experiment design to guide 
exploration of a sensor moving along the surface of an 
object parameterized as an unknown quadric. 

Empirical work on using OED with neural networks is 
sparse. Plutowski and White [1993] successfully used it 
to filter an already-labeled data set for maximally infor¬ 
mative points. Choueiki [1994] has successfully trained 
neural networks on quadratic surfaces with data drawn 
according to the D-optimality criterion, which is dis¬ 
cussed in the appendix. 

2 Learning with static and dynamic 
constraints 

As seen in Section 1.2, different problems impose dif¬ 
ferent constraints on the specification of x. These con¬ 
straints may be classified as being either static or dy¬ 
namic, and problems with dynamic constraints may be 
further divided according to whether or not the dynam¬ 
ics of the constraints are known a priori. The remainder 
of this section elaborates on these distinctions. Exam¬ 
ples, with experimental results for each category, will be 
given in Section 4. 

2.1 Active learning with static constraints 

When a learner has static input constraints, its range 
of choices for x is fixed, regardless of previous actions. 
Examples of problems with static constraints include set¬ 
ting mixtures of ingredients for an industrial process or 
selecting places to take seismic or electromagnetic mea¬ 
surements to locate buried waste. 

The bulk of research on active learning per se has 
concentrated on learning with static constraints. In 
this setting, active learning algorithms are compared 
against algorithms learning from randomly chosen ex¬ 
amples. In general, the number of randomly chosen ex¬ 
amples needed to achieve an expected error of no more 
than e scales as O(Mogi) [Blumer et al., 1989; Baum 
and Haussler 1989; Cohn and Tesauro, 1992; Haussler, 

1992] . In some situations, active selection of training ex¬ 
amples can reduce the sample complexity to O(logi), 3 
although worst case bounds for unconstrained querying 
are no better than those for choosing at random [Eisen- 
burg and Rivest, 1990]. Average case analysis indicates 
that on many domains the expected performance of ac¬ 
tive selection of training examples is significantly bet¬ 
ter than that of random sampling [Freund and Seung, 

1993] ; these results have also been supported by empiri¬ 
cal studies [Cohn et al., 1990; Hwang et al., 1991; Baum 
and Lang, 1991]. 

A limitation of the active learning algorithms men¬ 
tioned above is that they are only applicable to specific 
active learning problems: the algorithms of Cohn et al., 
and Hwang et al. are limited to classification problems, 
and Baum and Lang’s algorithm is further limited to a 

3 Consider cases where binary search is applicable. 



specific network architecture (single hidden layer with 
sigmoidal units). The OED-based approach discussed 
in this paper is applicable to any network architecture 
whose output is differentiable with respect to its param¬ 
eters, and may be used on both regression and classifi¬ 
cation problems. 
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Figure 2: In problems with dynamic constraints, the set 
of candidate £ can change after each query. The iq+i 
accessible to the learner on the bottom depends on the 
choice made for i t . 

2.2 Active learning with dynamic constraints 

In many learners, the constraints on £ are dynamic, and 
change over time. Inputs that are available to us on 
one time step may no longer be available on the next. 
Typically, these constraints represent some stafq of the 
system that is altered by the learner’s actions. Training 
examples then describe a trajectory through state space. 

In some such problems the dynamics of the constraints 
are known, and we may predict a priori what constraints 
we will face at time f, given an initial state and actions 
a? i, a? 2 , - - -, x t- Consider Example 1, using a mobile sen¬ 
sor array to locate buried waste. We can pre-plan the 
course the vehicle will take, but its successive measure¬ 
ments are constrained to lie in the neighborhood of pre¬ 
vious ones. Alternatively, consider learning the forward 
kinematics of an arm: we specify joint angles 0 in an 


attempt to predict the arm’s tip coordinates C. Barring 
any unknown obstacles, we can move from our current 
position 0 t to a neighboring 0*+i, but can not select an 
arbitrary © t+ i for the next time step. 

A more common, and more difficult problem is learn¬ 
ing when the dynamics of the constraints are not known, 
and must be accommodated online. Learning the dy¬ 
namics of a robot arm {0t,0t,r t } —*■ {0 t+ i, 0t+i} is 
an example of this type of problem. At each time step 
t , the model input £ is a state-action pair {0*, 0*, r t }, 
where 0 t and 0 t are constrained to be the learner’s cur- 
tent. state. Until the action is selected and taken, the 
learner does not know what its new state, and thus its 
new constraints will be (this is in fact exactly what it is 
attempting to learn). 

In most forms of constrained learning problems, ran¬ 
dom exploration is a poor strategy. Taking random ac¬ 
tions leads to a form of “drunkard’s walk” over A', which 
can require an unacceptably large number of examples 
to give good coverage [Whitehead, 1991]. 

In cases where the dynamics of the constraints are 
known a priori, we can plan a trajectory that will uni¬ 
formly cover A' in some prespecified number of steps. In 
general, though, we will have to resort to some online 
process to decide “what to try next.” Some success¬ 
ful heuristic exploration strategies include trying to visit 
unvisit.ed states [Schaal and Atkeson, 1994], trying to 
visit places where we perform poorly [Linden and We¬ 
ber, 1993], taking actions that improved our performance 
in similar situations [Schmidhuber and Storck, 1993], or 
maintaining a heuristic “confidence map” [Thrun and 
Moller, 1992]. Some researchers, in cases where the ex¬ 
ploration is considered a secondary problem, provide the 
learner with a uniformly distributed training set, in ef¬ 
fect assuming the problem allows unconstrained query¬ 
ing (e.g. Mel [1992]). 

An important limitation of the above work with dy¬ 
namic constraints is that, for the most part, the methods 
are restricted to discrete state spaces. Continuous state 
and action spaces must be accommodated either through 
arbitrary discretization or through some form of on-line 
partitioning strategy, such as Moore’s Parti-Game al¬ 
gorithm [Moore, 1994]. The OED-based approach dis¬ 
cussed in this paper is, by nature, applicable to domains 
with both continuous state and action spaces. 

3 Data selection according to OED 

In this section, we review the theory of optimal exper¬ 
iment design applied to neural network learning. As 
stated in the introduction, our primary goal is to mini¬ 
mize Emse • An alternative goal of system identification 
is discussed briefly in the appendix, and other interesting 
goals, such as eigenvalue maximization and entropy min¬ 
imization, may be found in Fedorov [1972] and MacKa.y 
[1992], 

Error minimization is pursued in the OED framework 
by selecting data to minimize model uncertainty. Uncer¬ 
tainty in this case is manifested as the learner’s estimated 
output variance <r|. The justification for selecting data 
to minimize variance comes from the nature of MSE. 





Defining y\x = (y\x) Y , mean squared error may be de¬ 
coupled into variance and bias terms. 

Emse = ((y\x - y\x) 2 ) x 

= {{y\x-y\xf) x + ({y\x-y\xf) x 
= + ((V\x - y\x) 2 ) x . 

Given an unbiased estimator, or an estimator for which 
either the bias is small or independent of the training 
set, error minimization amounts to minimizing the vari¬ 
ance of the estimator. For the rest of our computations 
we will neglect the bias term, and select data solely to 
minimize the estimated variance of our learner . 4 An in- 
depth discussion of bias/variance tradeoff may be found 
in Geman et al. [1992]. 


3.1 Estimating variance 


Estimates for < 7 ? may be obtained by adopting tech¬ 


niques derived for linear systems. We write the network’s 
output sensitivity as g(x) = dyjx/dw = dfu,(x)/dw, and 
define the Fisher Information Matrix to be 


A = 


1 d 2 S 2 


S 2 dw 2 


1 m 

2=1 

dy\xi dy\x. 

dw dw 

^ m 

ip^yixdyixif • 
2=1 


(y\xi - yi) 


d 2 y \xj 
dw 2 


(1) 


The approximation in Equation 1 holds when the net¬ 
work fits the data well or the error surface has relatively 
constant curvature in the vicinity of w. We may then 
write the parameter covariance matrix as cC = A -1 and 
the output variance at reference input x r as 

°l\x T ~ yixrf A- 1 g{x r ) (2) 

subject to the same approximations (see Thisted [1988] 
for derivations ). 5 Note that the estimate applies 

only to the variance at a particular reference point. Our 
interest is in estimating <r|, the average variance over all 
of X. We do not have a method for directly integrating 
over X, and instead opt for a stochastic estimate based 
on an average of cr 2 ^ x , with x r drawn according to Vx- 
Writing the first and second moments of g as ~g = (g(x)} x 
and gg T = (g(x)g(x) T } x , this estimate can be computed 
efficiently as 

(^) x = g T A- 1 ]} +tr(A~ 1 gg T ), (3) 

where tr( ) is the matrix trace. Instead of recomput¬ 
ing Equation 2 for each reference point, Tj and gg T may 
be computed over the reference points, and Equation 3 
evaluated once. 

4 The bias term will in fact reappear as a limiting factor 
in the experimental results described in Section 4.2. 

5 If the inverse A -1 does not exist, then the parameter 
covariances are not well-defined. In practice, one could use 
the pseudo-inverse, but the need to this arose very rarely in 
our experiments, even with small training sets. 


3.2 Quantifying change in variance 

When an input x is queried, we obtain the resulting out¬ 
put y. When the new example (x,y) is added to the 
training set, the variance of the model will change. We 
wish to select x optimally, such that the resulting vari¬ 
ance, denoted <r|, is minimized. 

The network provides a (possibly inaccurate) estimate 
of the distribution V(y\x), embodied in an estimate of 
the mean (y\x) and variance ( S 2 ). Given infinite com¬ 
putational power then, we could use these estimates to 
stochastically approximate < 7 ?^ by drawing examples ac¬ 
cording to our estimate of V(y\x), training on them, and 
averaging the new variances. In practice though, we 
must settle for a coarser approximation. Note that the 
approximation in Equation 1 is independent of the actual 
yi values of the training set; the dependence is implicit in 
the choice of w that minimizes S 2 . If V(y\x) conforms 
to our expectations, w and g( ) will remain essentially 
unchanged, allowing us to compute the new information 
matrix A as 

A ~ A + -^-g(£)g(£) T . (4) 

From the new information matrix, we may compute the 
new parameter variances, and from there, the new out¬ 
put variances. By the matrix inversion lemma 

1 _1 = (a + T g(x)g(i) T ') 


,_i ^4 1 g(x)g(x) T A 1 

S 2 + g(£) T A~ 1 g(£)' 

The utility of querying at £ may be expressed in terms 
of the expected change in the estimated output variance 
oj. The expected new output variance at reference point 

X r is 


y\x. 


= g(x r ) A i g(x r ) 


= g{xrf 


A- 1 - 


A 1 g(x)g(x) T A~ 


= g(x r ) A g(x r ) - 


S 2 + g(x) T A~ 1 g(x) 
[g{x r ) T A- 1 g(x)] 


g(x r ) 

2 


S 2 + g(x) T A~ 1 g(x) 


= ov,, - 


y\x T 




where dy\ Xr % is defined as g(x r ) 1 A~ 1 g(x). Thus, when 
x is queried, the expected change in output variance at 
x r is 

lx = , (6) 




Y 


S 2 +a 


y\x 


We compute ^A< 7 ?^ \x as a stochastic approximation 

from ^ \x for x r drawn from Vx- Reusing the 

estimate gg T from the previous section, we can write 
the expectation of Equation 6 over X as 

_ g{x) T A~ 1 gg T A~ 1 g(x) 


(A<7? I*' 


X,Y 


S 2 + g(x) T A~ 1 g(x) 


(7) 


4 



3.3 Selecting an optimal x 

Given Equation 7, the problem remains of how to select 
an x that maximizes it. One approach to selecting a next 
input is to use selective sampling: evaluate a number of 
possible random £•, then choose the one with the highest 
expected gain. This is efficient so long as the dimension 
of the action space is small. For high-dimensional prob¬ 
lems, we may use gradient ascent to efficiently find good 
■x. Differentiating Equation 7 with respect to x gives a 
gradient 


Vf (Ac, 


_ 2 g(£) A 1 gg T A 1 dg(x) 

y f x,Y (S 2 + g(x) T A~ 1 g(x)) 2 dx 


( 8 ) 


We can “hillclimb” on this gradient to find a x with a 
locally optimal expected change in average output vari¬ 
ance. 

It is worth noting that both of these approaches are 
applicable in continuous 1 domains, and therefore well- 
suited to problems with continuous action spaces. Fur¬ 
thermore the gradient approach is effectively immune 
to the overabundance of candidate actions in high¬ 
dimensional action spaces. 


3.4 A caveat: greedy optimality 

We have described a criterion for one-step, or greedy op¬ 
timization. That is, each action/query is chosen to max¬ 
imize the change in variance on the next step, without 
regard to how future queries will be chosen. The glob¬ 
ally optimal, but computationally expensive approach 
would involve optimizing over an entire trajectory of m 
actions/queries. Trajectory optimization entails starting 
with an initial trajectory, computing the expected gain 
over it, and iteratively relaxing points on the trajectory 
towards optimal expected gains (subject to other points 
along the trajectory being explored). After the iteration 
has settled, the first point in the trajectory is queried, 
and the relaxation is repeated on the remaining part of 
the trajectory. Experiments using this form of optimiza¬ 
tion did not demonstrate measurable improvement, in 
the average case, over the greedy method, so it appears 
that trajectory optimization may not be worth the ad¬ 
ditional computational expense, except in extreme situ¬ 
ations (see Sollich [1994] for a theoretical comparison of 
greedy and globally-optimal querying). 


4 Experimental Results 

In this section, we describe two sets of experiments us¬ 
ing optimal experiment design for error minimization. 
The first attempts to confirm that the gains predicted 
by optimal experiment design may actually be realized in 
practice, and the second applies OED to learning tasks 
with static and dynamic constraints. All experiments 
described in this section were run using feedforward net¬ 
works with a single hidden layer of 20 units. Hidden and 
output units used the 0-1 sigmoid as a nonlinearity. All 
runs were performed on the Xerion simulator [van Camp 
et. ah, 1993] using the default weight update rule ( “Rudi’s 
Conjugate Gradient” with “Ray’s Line Search” ) with no 
weight decay term. 


4.1 Expected versus actual gain 

It must be emphasized that the gains predicted by OED 
are expected gains. These expectations are based on the 
series of approximations detailed in the previous sec¬ 
tion, which may compromise the realization of any actual 
gain. In order for the expected gains to materialize, two 
“bridges” must be crossed. First, the expected decrease 
in model variance must be realized as an actual decrease 
in variance. Second, the actual decrease in model vari¬ 
ance must translate into an actual decrease in model 
MSE. 
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Figure 3: (top) Correlations between expected change 
in output variance and actual change output variance, 
(bottom) Correlations between actual change in output 
variance and change in mean squared error. Correlations 
are plotted for a network with a single hidden layer of 20 
units trained on 50 examples from the arm kinematics 
task. 
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4.1.1 Expected decreases in variance —*■ actual 
decreases in variance 

The translation from expected to actual changes in 
variance requires coordination between the exploration 
strategy and the learning algorithm: to predict how the 
variance of a weight will change with a new piece of data, 
the predictor must know how the weight itself (and its 
neighboring weights) will change. Using a black box rou¬ 
tine like backpropagation to update the weights virtually 
guarantees that there will be some mismatch between 
expected and actual decreases in variance. Experiments 
indicate that, in spite of this, the correlation between 
predicted and actual changes in variance are relatively 
good (Figure 3a). 

4.1.2 Decreases in variance —*■ decreases in 
MSE 

A more troubling translation is the one from model 
variance to model correctness. Given the highly nonlin¬ 
ear nature of a neural network, local minima may leave 
us in situations where the model is very confident but 
entirely wrong. Due to high confidence, the learner may 
reject actions that would reduce its mean squared error 
and explore areas where the model is correct, but has low 
confidence. Evidence of this behavior is seen in the lower 
right corner of Figure 3b, where some actions which pro¬ 
duce a large decrease in variance have little effect on 
Emse • This behavior appear to be a manifestation of 
the bias term discussed in Section 3; these queries reduce 
variance while increasing thg learner’s bias, with no net 
decrease in error. While this demonstrates a weak point 
in the OED approach (which will be further illustrated 
below), we find in the remainder of this section that its 
effect is negligible for many classes of problems. 

4.2 Querying with static constraints 

Here we consider a simple learning problems with static 
constraints: learning the forward kinematics of a planar 
arm from examples. The input A' = {0i,02} speci¬ 
fied the arm’s joint angles, and the learner attempted 
to learn a map from these to the Cartesian coordinates 
Y = {Ci,C'j} of the arm’s tip. The “shoulder” and “el¬ 
bow” joints were constrained to the 0 — 360° and 0 — 180° 
respectively; on each time step the learner was allowed 
to specify an arbitrary x £ A' within those limits. 

For the greedy OED learner, x was chosen by begin¬ 
ning at a random point in A' and hillclimbing the gradi¬ 
ent. of Equation 6 to a local maximum before querying. 
This strategy was compared with simply choosing x at 
random, and choosing x according to a uniform grid over 
A'. 6 

We compared the variance and MSE of the OED- 
ba.sed learner with that of the random and grid learn¬ 
ers. The average variance of the OED-ba.sed learner was 
almost identical to that of the grid learner and slightly 
better than that of the random learner (Figure 4b). In 

6 Note the uniform grid strategy is not viable for incre¬ 
mentally drawn training sets - the size of the grid must be 
fixed before any examples are drawn. In these experiments, 
entirely new training sets of the appropriate size were drawn 
for each new grid. 


terms of MSE however, the greedy OED learner did not 
fare as well. Its error was initially comparable to that of 
the grid strategy, but flattened out at an error approx¬ 
imately twice that of the asymptotic limit (Figure 4b). 
This flattening appears to be a result of bias. As dis¬ 
cussed in Section 3, the network’s error is composed of a 
variance term and a bias term, and the OED-ba.sed ap¬ 
proach, while minimizing variance, appears (in this case) 
to leave a. significant, amount, of bias. 
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Figure 4: Querying with static constraints to learn the 
kinematics of a. planar two-joint, arm. (top) Variance 
using OED-ba.sed actions is bptt.er than that, using ran¬ 
dom queries, and matches the variance of a. uniform 
grid, (bottom) MSE using OED-ba.sed actions is ini¬ 
tially very good, but. breaks down at. larger training set. 
sizes. Curves are averages over six runs apiece for OED 
and grid learners, and 12 runs for the random learner. 



4.3 Querying with dynamic constraints 


For learning with dynamic constraints, we again used the 
planar arm problem, but this time with a more realistic 
restriction on new inputs. For the first series of experi¬ 
ments, the learner learned the kinematics by incremen¬ 
tally adjusting 0i and ©2 from their values on the pre¬ 
vious query. The limits of allowable movement on each 
step corresponded to constraints with known dynamics. 
The second set. of experiments involved learning the dy¬ 
namics of the same arm based on torque commands. The 
unknown next state of the arm corresponded to con¬ 
straints with unknown dynamics. 

4.3.1 Constraints with known dynamics 

To learn the arm kinematics, the learner hillclimbed 
to find the 0i and ©2 within its limits of movement that 
would maximize the stochastic approximation of Avar. 
On each time step ©i and ©2 were limited to change by 
no more than ±36° and ±18° respectively. 

We compared variance and MSE of the OED-ba.sed 
learner with that of an identical learner which explored 
randomly by “flailing,” and with a learner trained on a 
series of hand-tuned trajectories. 

The greedy OED-ba.sed learner found exploration tra¬ 
jectories that, intuitively, appear to give good global cov¬ 
erage of the domain (see Figure 5). In terms of perfor¬ 
mance, the average variance over the OED-ba.sed trajec¬ 
tories was almost, as good as that, of the best, ha.nd-t.uned 
trajectory, and both were far better than that, of the 
random exploration trajectories. In terms of MSE, the 
average error over OED-ba.sed trajectories was almost, 
as good as that, of the best, ha.nd-t.uned trajectory, and 
again, both were far better than the random exploration 
trajectories (Figure 6). Note that in this case, bias does 
not. seem to play a. significant, role. We discuss the per¬ 
formance and computational complexity of this task in 
greater detail in Section 5. 



Figure 5: Querying with dynamic constraints: learning 
2D arm kinematics. Example of OED-ba.sed learner’s 
trajectory through angle-space. 


4.3.2 Constraints with unknown dynamics 

For this set. of experiments, we once again used 
the planar t.wo-joint.ed arm, but. now attempted to 
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Figure 6: Querying with dynamic constraints: learning 
2D arm kinematics, (top) Variance using greedy OED 
actions is better than that, using random exploration, 
and matches the variance of the best, ha.nd-t.uned tra¬ 
jectory. (bottom) MSE using greedy OED-ba.sed explo¬ 
ration is much better than that, of random exploration 
and almost, as good as that, of the best, ha.nd-t.uned tra¬ 
jectory. Curves are averages over 5 runs apiece for OED- 
ba.sed and random exploration. 


learn the arm dynamics. The learner’s input. A' = 
{Oi, © 2 , Oi, © 2 , 7i, 7"2} specified the joint, positions, ve¬ 
locities and torques. Based on these, the learner 
attempted to learn the arm’s next, state Y = 
{Oj, ©o, Oj, © 2 }. As with the kinematics experiment., 
we compared random exploration with the greedy OED 
strategy described in the previous section. Without, 
knowing the dynamics of the input, constraints, however, 
we do not. have the ability to specify a. preset, trajectory. 

The performance of the learner whose exploration was 
guided by OED was asymptotically much better than 
that, of the learner following a. random search strategy 
(Figure 7). It. is instructive to notice, however, that this 
improvement, is not. immediate, but. appears only after 
the learner has taken a. number of steps.' Intuitively, 

1 This behavior is visible in the other problem domains as 
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Batch train 
Incremental train 
Compute exact A 
Compute approx. A 
Invert to get A~ 1 
Compute var(x r ) 
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Compute .E^AtwjA')!*] 5.4*10 

Compute gradient 1.9*10" 


Table 1: Typical compute times, in seconds, for opera¬ 
tions involved in selecting new data and training. Num¬ 
ber of weights in network = n, number of training ex¬ 
amples = nr, and number of reference points (at which 
variance or gradient is measures) = r. Time constants 
are for runs performed on a Sparc 10 using the Xerion 
simulator. 


Figure 7: MSE of forward dynamic model for two-joint, 
planar arm. 

this may be explainable by the assumptions made in the 
OED formalism: the network uses its estimate of vari¬ 
ance of the current model to determine what data will 
minimize the variance. Until there is enough data for the 
model to become reasonably accurate, the estimates will 
be correspondingly inaccurate, and the search for “opti¬ 
mal” data will be misled. It would be useful to have a 
way of determining at what point the learner’s estimates 
become reliable, so that one could explore randomly at 
first, then switch to OED-guided exploration when the 
learner’s model is accurate enough to take advantage of 
it. 

5 Computational costs and 
approximations 

The major concern with applying the OED techniques 
described in this paper is computational cost.. In this sec¬ 
tion we consider the computational complexity of select¬ 
ing actions via. OED techniques, and consider several ap¬ 
proximations aimed at. reducing the computational costs. 
These costs are summarized in Table 1, with the time 
constants observed for runs performed on a. Sparc 10. 

We divide the learning process in three steps: train¬ 
ing, variance estimation, and data, selection. We show 
that, for the case examined, in spite of increased com¬ 
plexity, the improvement in performance more than war¬ 
rants the use of OED for data, selection. 

Cost of training: Two training regimens were tested 
for the OED-guided learners: batch training reinitial¬ 
ized after each new example was added, and incremen¬ 
tal training, reusing the previous network’s weights af¬ 
ter each new example. While the batch-trained learners’ 
performance was slightly better, their total training time 
was significantly longer than their incrementally trained 
counterparts (Figure 8). 

well, but. is not. as pronounced. 


Cost of variance estimation: (Equation 3) Vari¬ 
ance estimation requires computing and inverting the 
Hessian. The inverse Hessian may then be used for an 
arbitrary number of variance estimates and must, only be 
recomputed when the network weights are updated. The 
approximate Hessian of Equation 1 may be computed in 
time 0(mn 2 ), but. the major cost, remains the inversion. 
We have experimented with diagonal and block diago¬ 
nal Hessians, which may be inverted quickly, but. with¬ 
out. the off-diagonal terms, the learner failed to generate 
reasonable training sets. Recent, work by Pea.rlmut.t.er 
[1994] offers a. way to bring the cost, of computing the 
first, term of Equation 3, but. computing the second term 
remains an 0(n 3 ) operation. 

Cost of data selection: (Equations 6, 7 and 8) 
Computing Equation 6 is an 0(n 2 ) operation, which 
must, be performed on each of r reference points, and 
must, be repeated for each candidate *. Alternatively, 
the “moment-based” selection (Equation 7) and gradi¬ 
ent. methods (Equations) both require an 0(n 3 ) matrix 
multiplication which must, be done once, after which any 
number of iterations may be performed with new * in 
time 0(n 2 ). Using Perlmut.t.er’s approach to directly 
approximate A _1 (/(*) would allow an approximation of 
Equation 7 to be computed in 0(n 2 ) times an “accu¬ 
racy” constant.. We have not. yet. determined what, effect, 
this time/accuracy tradeoff has on network performance,. 

The payoff: cost vs. performance. Obviously, the 
OED-ba.sed approach requires significantly more compu¬ 
tation time than does learning from random examples. 
The payoff comes when relative performance is consid¬ 
ered. We turn again to the kinematics problem discussed 
in Section 4.3.1. The approximate total time involved in 
training a. learner on 100 random training examples from 
this problem (as computed from Table 1) is 170 seconds. 
For “full-blown” OED, using incremental training, the 
total time is 790 seconds. As shown in Figure 8, ex¬ 
ploring randomly causes our MSE to decrease roughly 
as an inverse polynomial, while the various OED strate¬ 
gies decrease MSE roughly exponentially in the number 
of examples. To achieve the MSE reached by training on 
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Figure 8: Learning curves for the kinematics problem 
from Section 4.2. Best fit. functional forms are plotted 
for random exploration, incrementally-trained OED and 
OED completely retrained on new data set. 


OED-selected data, we would need to train on approx¬ 
imately 3380 randomly selected data examples. This 
would take approximately 7500 seconds, over two hours! 
With this much data, the training time alone is greater 
than the total OED costs, so regardless of data costs, 
selecting data via OED is the preferable approach. 

With the kinematics example there is the option of 
hand-tuning a learning trajectory, which requires no 
more data than the OED approach, and can nominally 
be learned in less time. This, however, required hours of 
human intervention to repeatedly re-run the simulations 
trying different preset exploration trajectories. In the 
dynamics example and in other cases where the state 
transitions are unknown, preset exploration strategies 
are not an option; we must rely on an algorithm for 
deciding our next action, and the OED-ba.sed strategy 
appears to be a viable, statistically well-founded choice. 


6 Conclusions and Future Work 


The experiments described in this paper indicate that, 
for some tasks, optimal experiment design is a promis¬ 
ing tool for guiding active learning in neural networks. 
It requires no arbitrary discretization of state or action 
spaces, and is amenable to gradient search techniques. 
The appropriateness of OED for exploration hinges on 
the two issues described in the previous two sections: the 
nature of the input constraints and the computational 
load one is able to bear. 

For learning problems with static constraints, the ad¬ 
vantage of applying OED, or any form of intelligent ac¬ 
tive learning appears to be problem dependent. Random 
exploration appears to be reasonably good at decreasing 
variance, and as seen in Section 4.2, appears to decrease 
bias as well. For a problem where learner bias is likely to 
be a major factor, the advantages of the OED approach 
are unclear. 

The real advantage of the OED-ba.sed approach ap¬ 
pears to lie in problems where the input, constraints are 
dynamic, and where random actions fail to provide good 
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exploration. Compared with arbitrary heuristics, the 
OED-ba.sed approach has the arguable advantage of be¬ 
ing the “right, thing to do,” in spite of its computational 
costs. 

The cost., however, is a. major drawback. A decision 
time on the order of 1-10 seconds may be sufficient, for 
many applications, but. is much too long to guide real¬ 
time exploration of dynamical systems such as robotic 
arms. The operations required for hessian computa¬ 
tion and data, selection may be efficiently parallelized; 
the remaining computational expense lies in retraining 
the network to incorporate each new example. The re¬ 
training cost., which is common to all on-line neural 
exploration algorithms, may be amortized by selecting 
queries/actions in small batches rather than purely se¬ 
quentially. This “semi-batched” approach is a. promising 
direction for future work. 

Another promising direction, which offers hope of 
even greater speedups than the semi-batch approach, is 
switching to an alternative, entirely non-neura.l learner 
with which to pursue exploration. 

6.1 Improving performance with alternative 
learners 

We may be able to bring down computational costs and 
improve performance by using a. different architecture 
for the learner. With a. standard feedforward neural 
network, not. only is the repeated computation of vari¬ 
ances expensive, it. somet.imigt fails t.o yield estimates 
suitable for use as confidence intervals (as we saw in 
Section 4.1.2). A solution to both of these problems 
may lie in selection of a. more amenable architecture and 
learning algorithm. Two such architectures, in which 
output, variances have a. direct, role in estimation, are 
mixtures of Ga.ussia.ns [McLa.chla.n and Ba.sford, 1988; 
Nowla.n, 1991; Gha.hra.ma.ni and Jordan, 1994] and lo¬ 
cally weighted regression [Cleveland ah, 1988; Scha.al 
and At.keson, 1994]. Both have excellent, statistical mod¬ 
eling properties, and are computationally more tractable 
than feedforward neural networks. We are currently pur¬ 
suing the application of optimal experiment, design tech¬ 
niques to these models and have observed encouraging 
preliminary results [Cohn et. ah, 1994]. 

6.2 Active elimination of bias 

Regardless of which learning architecture is used, the 
results in Section 4.2 make it. clear that minimizing vari¬ 
ance alone is not enough. For large, da.t.a.-poor problems, 
variance will likely be the major source of error, but. as 
variance is removed (via. the techniques described in this 
paper), the bias will constitute a. larger and larger por¬ 
tion of the remaining error. 

Bias is not. as easily estimated as variance; it. is usu¬ 
ally estimated by expensive cross validation, or by run¬ 
ning ensembles of learners in parallel (see, e.g. Gerna.n 
et. a.l. [1992] and Connor [1993]). Future work will need 
to include methods for efficiently estimating learner bias 
and taking steps to ensure that it. too is minimized in an 
optimal manner. 
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Appendix — System identification with 
neural networks and OED 


System identification using OED has been successful on 
tasks where the parameters of the unknown system are 
explicit, but for a neural network model, system identi¬ 
fication is problematic. The weights in the network can 
not be reasonably considered to represent real param¬ 
eters of the unknown system being modeled, so there 
is no good interpretation of their “identity.” A greater 
problem is the observation that unless the network is 
fortuitously structured to be exactly the correct size, 
there will be extra unconstrainable parameters in the 
form of unused weights, about which it will be impos¬ 
sible to gain information. Distinguishing between un¬ 
constrainable parameters (which we wish to delete or ig¬ 
nore) and underconstrained parameters (about which we 
wish to get more information) is an unsolved problem. 
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Below, we review the derivation of the “D-optimality” 
criterion appropriate for system identification [Fedorov, 
1972; MacKay, 1992], and briefly discuss experiments 
selecting D-optimal data. 


When doing system identification with a neural network, 
we are interested in minimizing the covariance of the pa¬ 
rameter estimates w. For the purposes of optimization, 
it is convenient to express a'l as a scalar. The most 
widely used scalar is the determinant D = |<7?,|, which 
has an interpretation as the “volume” of parameter space 
encompassed by the variance (for other approaches see 
Atkinson and Donev [1992]). 

The utility of querying at x, from a system identification 
viewpoint, may be expressed in terms of the expected 
change in the estimated value of D. The expected new 
value D is 


D=\A~ 1 \ 


S 2 + g{x) T A~ 1 g(x) 


S 2 


(9) 


which, by subtraction from the original estimate D gives 


AD\i 
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y\x 


32 + <4 


( 10 ) 


Equation 10 is maximized where <7?,- is at a maximum, 
giving the intuitively pleasing interpretation that for sys¬ 
tem identification, parameter uncertainty is minimized 
by querying where our uncertainty is largest. Such 
queries are, in OED terminology, “D-optimal.” 


Our experiments using the above criterion to select train¬ 
ing data had limited success. On regression problems 
such as the arm kinematics the learner performed poorly, 
attempting to select data at x = ±oo. These results are 
consistent with the comments at the beginning of this 
section, and with MacKay’s observation that, for learn¬ 
ing X —>■ Y mappings, the system identification crite¬ 
rion may be the “right solution to the wrong problem” 
[MacKay, 1992]. The criterion addressed in Section 3, 
also mentioned by MacKay and explored in greater detail 
in this paper, appears to address the “right” problem. 



